INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

There are not any missing values in this dataset.

Exploratory Data Analysis (EDA)

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

EDA Insights

Most of our data includes guests who have two adults in their party, zero children, spend zero to two week end nights in the hotel, spend 1 to 4 week nights in the hotel, do not require a car space, have a lead time of less than 200 days, arrived in 2018, are not repeated guests, do not have previous cancellations, have not previously booked with the hotel, spend the night in a $50 to %150 hotel room, and have 0 to 1 special requests.

Data Preprocessing

EDA

Building a Logistic Regression model

There is only a little bit of class imbalance

Checking Multicollinearity

Model performance evaluation

Our y variable is booking status not cancelled which = 1 when customer does not cancel.

Repeated guests have a 24.18 times chance of not cancelling their reservation.

Another way of viewing this data, is to see that, for example, customers who reserve their room through the offline market segment type have a 815.05% chance of not cancelling, so most likely this customer will not cancel their reservation.

Final Model Summary

We decreased the conflicts within the model by deleting misleading variables. The model also improves accuracy, when comparing to the original confusion matrix constructed.

The model is performing well on the training set, given the area under the curve as well as the shape.

The training set are close together in value and predict almost 80% of the data, which means they are performing very well under this machine learning logistic regression model.

Building a Decision Tree

Accuracy is not a good indicator here since the majority of bookings were not cancelled.

Since we don't want people to cancel their bookings we should use Recall as a metric of model evaluation instead of accuracy.

2864 guests cancelled their booking when predicted 6600 guests did not cancel their booking when predicted that they would not cancel their bookings

Do we need to prune the tree?

Recall on training set has reduced from 1 to 0.81 but this is an improvement because now the model is not overfitting and we have a generalized model.

Visualizing the Decision Tree

Using GridSearch for Hyperparameter tuning of our tree model

Confusion Matrix - decision tree with tuned hyperparameters

Testing performs slightly better than the training set!

Visualizing the Decision Tree

Here we see that the importance of features have all decreased to zero. This model is severely underfit.

Cost Complexity Pruning

Total impurity of leaves vs effective alphas of pruned tree

This has 1345 trees.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

As alpha changes the tree becomes simpler and depth decreases.

Accuracy vs alpha for training and testing sets

Train and test are both good fit, even though test doesn't reach the full 1.00 accuracy point.

Since accuracy isn't the right metric for our data we would want high recall

Confusion Matrix - post-pruned decision tree

With post-pruning we get the highest recall on the test set and closest values of train to test set.

Visualizing the Decision Tree

Model Performance Comparison and Conclusions

Although the decision tree with the highest recall is the decision tree with hyperparameter tuning, this tree only has one node to the entire tree and doesn't represent the data.
The tree with the highest test recall value, besides the tree with hyperparameter tuning, is the decision tree with post-pruning. This is the best model of the decision trees.

Actionable Insights and Recommendations